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ABSTRACT 

The recently-discovered polar codes are widely seen as a ma- 
jor breakthrough in coding theory. These codes achieve the capacity 
of many important channels under successive cancellation decoding. 
Motivated by the rapid progress in the theory of polar codes, we pro- 
pose a family of architectures for efficient hardware implementation 
of successive cancellation decoders. We show that such decoders can 
be implemented with 0(n) processing elements and O(n) mem- 
ory elements, while providing constant throughput. We also pro- 
pose a technique for overlapping the decoding of several consecutive 
codewords, thereby achieving a significant speed-up factor. We fur- 
thermore show that successive cancellation decoding can be imple- 
mented in the logarithmic domain, thereby eliminating the multipli- 
cation and division operations and greatly reducing the complexity 
of each processing element. 

Index Terms — Polar codes, successive cancellation decoding, 
hardware implementation, VLSI. 

1. INTRODUCTION 

Polar codes 1 1 1 are a family of error correcting codes with an explicit 
construction and efficient encoding and decoding algorithms. More- 
over, they achieve capacity (asymptotically in the code length n) if 
the underlying channel is symmetric, memoryless, and has binary- 
input. To date, no other family of codes possesses these attributes, 
hence polar codes are seen as a major breakthrough in coding theory. 
Not surprisingly, polar codes have garnered much interest recently 
in the coding theory community. From a practical point of view, the 
capacity of the channel can be approached at the expense of a large 
code length (n = 2 20 bits). In some information theoretic applica- 
tions, polar codes are the only known solution which is both explicit 
and efficient: for example achieving the secrecy capacity of the 
wiretap channel in the general case |2|. Polar codes have recently 
been shown to have an efficient construction |3 ]. Recent results have 
started to address the issue of long code length. It was shown in 
|4| that applying belief propagation decoding on polar codes helps 
in reducing the required code length at the expense of extra com- 
plexity due to the iterative nature of belief propagation. Driven by 
the recent rapid progress in the theory of polar codes our motivation 
is to find efficient hardware architectures for SC decoding that will 
allow high-throughput and low-area implementations. Despite the 
numerous studies on polar codes construction and performance, the 
issue of hardware implementation of SC decoders remains an open 
problem. Initial results and a general framework for the implemen- 
tation of belief propagation decoders for polar codes are given in 
|5|. However, due to its lower complexity compared to belief prop- 
agation, we are motivated to study the hardware implementation of 
successive cancellation decoding. Ankan |1| showed that the SC 



decoding algorithm can be implemented in complexity 0(n log 2 n), 
where n is the code length. 

In this paper, starting from the general framework proposed by 
Ankan 1 1 1, we show that SC decoding can actually be implemented 
with hardware complexity 0(n). We also propose to increase the 
throughput by decoding several consecutive vectors at the same 
time. Finally, in order to reduce the complexity further, we ad- 
dress the implementation of the computational nodes by working in 
the logarithmic domain, thereby eliminating the multiplication and 
division operations. We show that the resulting transcendental func- 
tions can be approximated by the minimum function with negligible 
performance degradation. 

2. POLAR CODES 

Polar codes are linear block error-correcting codes. Assume 
from here onward that the underlying channel has binary input, 
is symmetric, and is memoryless. Fix n — 2 m as the code 
length. Denote by u = (uo, ui, . . . , u n -i) the input bits, and 
let c = (co, ci, . . . , Cn—i) be the corresponding codeworifl The 
encoding operation has an FFT structure, depicted in Figure [T] for 
m = 3. Note that the ordering of the m in Figure[T]is according to 
the bit-reversal order: if we reverse the order of the bits in the binary 
representation of i, then we get the standard lexicographic ordering. 

Recall that u is encoded to c. Next, c is sent over the un- 
derlying channel (the channel is used n times). Denote by y = 
(t/o, 3/i, • • • , 2/n-i) the corresponding channel output. We now 
wish to decode y. This is done in terms of a successive cancel- 
lation decoder. That is, given y, we first try to deduce the value 
of Mo, then that of ui, and so forth up until u n -i. We do this 
as follows. Assume that we are currently at stage i, and so, we 
have already guessed the values of Uq, Ui, . . . , Uj_i; denote these 
guesses as uq, ui, . . . , «i_i. Next, for b £ {0,1}, denote by 
Pr(y|ti ~ 1 , Ui = b) the probability that y was transmitted, given 
that u _1 = u _1 , that Ui = b, and that «i+i, Ui+2, • ■ • , «n-i are 
independent random variables with Bernoulli distribution of param- 
eter 1/2. If i is not in the frozen set (explained later), then we take 
the guessed value in to be 

f Prfrlfli-U^O) 

Ui = < ' fttyja^i^l) ' (1) 
[ 1, otherwise, 

Consider the case in which we are at stage i, and n _1 = u^ 1 
— that is, we have guessed correctly up until now. Then, as shown in 
1 1 1, for almost all < i < n we have that the probability of guessing 
Ui correctly is either extremely close to 1 (very good), or extremely 

1 Note that n input bits are encoded to a length n codeword. However, as 
we will see later on, not all of the n input bits carry information. 




Fig. 1. Encoder architecture for n = 8 



close to 1 /2 (very bad). That is, there is a polarization effect, as n 
tends to infinity. In order to keep the assumption u^" 1 = Uq _1 valid 
for all i (with very high probability), we freeze some m. That is, if 
the probability of guessing m is not very good, then we set its value 
to in both the encoder and the decoder, and thus no information is 
transmitted via it*. As shown in (T|, the fraction of indices i which 
are not frozen (the effective code rate) tends to the capacity of the 
underlying channel. 

3. SUCCESSIVE CANCELLATION DECODER 
IMPLEMENTATION 

3.1. FFT structure 

Ankan showed that SC decoding can be efficiently implemented by 
the factor graph of the code which has a structure resembling the 
Fast Fourier Transform (FFT). In the remainder of the paper, we 
will designate this decoder as the "FFT-like SC decoder". Figure 2 
shows the graph of the SC decoder for n — 8. Channel likelihood 
ratios (LRs) A; are assumed to be available on the right hand side of 
the graph while the estimated bits u; are on the left hand side. The 
SC decoder is composed of m = log 2 n stages each containing n 
nodes. We refer to a specific node as J\fi : j where / designates the 
stage index (0 < I < m) and j designates the node index within 
stage I (0 < j < n). Each node updates its output according to one 
of the two following update rules: 

/ /M) = ^f or, 

1 3ils (a,6) = a 1 - 2 ^6. W 

The values a and b are likelihood ratios while u s is a bit that rep- 
resents the partial modulo-2 sum of previously estimated bits. For 
example, in node A/"i,3, the partial sum is u s — 114 © W5. The value 
of u s determines if function g should be a multiplication or a di- 
vision. These update rules are complex to implement in hardware 
since they involve multiplications and divisions. In Section [4] we 
propose to perform these operations in the logarithmic domain and 
to apply an approximation to function /. For now we will consider 
/ and g to be black boxes until we return to them in Section|4] 
The sequential nature of the algorithm induces some data depen- 
dence within the processing. We notice that A/"i,2 can not be up- 
dated before the bit u\ is computed and a fortiori neither before uo 
is known. In order to respect the data dependence, a scheduling has 
to be defined. Ankan proposed two schedulings for this decoding 
framework 1 1 1. In the left-to-right scheduling, nodes recursively call 
their predecessors until an updated node is reached. The recursive 
nature of this scheduling is especially suitable for software imple- 
mentation. In the alternative right-to-left scheduling, any node up- 
dates its value whenever its inputs are available. Each bit V4 is suc- 
cessively estimated by activating the spanning tree rooted at7Vo,^(i). 




u s 

Fig. 2. FFT-like SC decoder architecture for n — 8 

In Figure|2]the tree associated with uo is highlighted. If we assume 
that a pipeline register is inserted between each stage or equivalently 
that each node processor can memorize its updated value, then some 
results can be reused. For example, in Figure [2] bit u\ can be de- 
coded by only activating A/"o,4 since A/"i,o and jVi,4 have already 
been updated during the decoding of Uq. Despite this well-defined 
structure and scheduling of the FFT-like decoder, in |T|, Ankan does 
not assess the problem of resource sharing, memory management or 
control generation that would be required for hardware implementa- 
tion. This framework however suggests that it could be implemented 
with n log 2 n combinatorial node processors together with n regis- 
ters between each stage to memorize intermediate results. In order to 
store the channel information, n extra registers are included as well. 
The total complexity of such a decoder is 

C T = (C np + C r )n log 2 n + nCr, (3) 

where C np and C r are the hardware complexity of a node processor 
and a register respectively. It can be shown that such a decoder with 
the right-to-left scheduling would take 2n — 2 clock cycles to decode 
n bits. The throughput in bits per second would then be 



{2ti 2)£np 2£np 

where t np is the propagation time in seconds through a node proces- 
sor. It follows that every node processor is actually used once every 
2n — 2 clock cycles. This motivates us to find a schedule to merge 
some of the nodes into a single processing element. 

3.2. Pipelined tree architecture 

Looking further into the scheduling, we notice that whenever stage 
/ is activated, only 2 l nodes are actually updated. For example in 
Figure [2] when stage is enabled, only one node is updated. Then 
the n nodes of stage can be implemented using a single process- 
ing element (PE). We note that in general, for stage /, 2 l processing 
elements (PEs) are sufficient to update the nodes. However, this re- 
source sharing does not necessarily guarantee that the memories as- 
signed to the merged nodes can also be merged. The memory sharing 
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Fig. 3. Pipelined tree SC architecture for n = 8. 
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Table 1. Schedule for the FFT-like and pipeline tree SC architectures 

(n = 8). 



depends on the liveness of generated variables. Table Q] shows the 
stage activation during the decoding of one vector y. When stage 
I is enabled, we indicate which function (/ or g) is applied to the 
2 l activated nodes at stage Si during each clock cycle (CC). Every 
generated variable is used twice during the decoding. For example, 
the four variables generated in stage 2 at CC #1 are used on CC #2 
and CC #5 in stage 1. This means that in stage 2, the four registers 
associated with the / function can be reused at CC #8 to memorize 
the four data values generated by the g function. This observation 
is applicable to any stage in the decoder. The resulting proposed 
architecture is shown in Figure [3] for n — 8. n registers are used 
to store the LRs Aj. The decoder is composed of a pipelined tree 
structure that includes n — 1 PEs, Pij, and n — 1 registers, Ki.j 
with < I < m — 1 and < j < 2 l . A decision unit gener- 
ates the estimated bit &j which is then broadcast back to every PE. 
A PE is a configurable element that can perform either function / 
or g. It also includes the u s computation block that updates the u s 
value with the last decoded bit Ui only if the control bit bij = 1. 
Another control bit bi is used to select function / or g. Compared 
to the FFT-like structure, the pipelined tree architecture performs the 
same amount of computation with the same scheduling (see TableQ} 
but with a reduced number of PEs and registers. Assuming that a PE 
(implementing / and g) represents twice the complexity of a node 
processor that implements a single / or g function, the pipelined tree 
decoder complexity is 

C T = (n-l)(2C tLp + C t )+nC I . (5) 

Moreover, one can notice that the routing network in the decoder is 
much simpler in the tree architecture than in the FFT-like structure. 
Connections between PEs are also local. This lowers the risk of con- 
gestion during the wire routing phase of an integrated circuit design 
and potentially increases the clock frequency and the throughput. 

3.3. Line SC Architecture 

Despite the low complexity of the pipelined tree architecture, it is 
possible to further reduce the number of PEs. Looking at Table Q] 




Fig. 4. Line SC architecture for n = 8. 



it appears that only one stage is activated at a time. In the worst 
case (activation of stage m — 1), ^ PEs have to be activated at the 
same time. This means that the same throughput can be achieved 
with only PEs. The resulting architecture is shown in Figure|4]for 
n — 8. The processing elements Pj are arranged in an line while 
registers keep a tree structure. Registers and PEs are connected via 
multiplexing resources that emulate the tree structure. For example 
since P2,o and Pi^ (in Figure [3} are merged to P2 (in Figure |4j, P2 
should write either to R2,o or Ri,o while it should also be able to 
read from the channel registers or from R2,o and Ra,i. The u 3 com- 
putation block is moved out of Pj and kept close to the associated 
register because u s should also be forwarded to the PE. The overall 
complexity of the line SC architecture is 

C T = (n- l)(C r + Cu 3 ) + nC np + (| - l) 3C mux + nC t (6) 

where C mux represents the complexity of a 2-input multiplexer and 
Cu s is the complexity of the ii a computation block. Despite the extra 
multiplexing logic required to route the data through the PE line, the 
savings in number of PEs makes this SC decoder less complex than 
the pipelined tree architecture while achieving the same throughput 
as computed in 10. 

It is possible to further reduce the number of PEs with only a small 
penalty in terms of throughput. Looking at Table Q] during the de- 
coding of one vector, stage I is activated 2 m_i times. Consequently, 
in the line architecture of Figure [4] ^ stages are all activated at the 
same time only twice during the decoding of a vector, regardless of 
the code size. A decoder with only 5 PEs would require only 2 ex- 
tra clock cycles to decode a vector. Such a semi-parallel architecture 
would improve the hardware efficiency at only a small decrease of 
throughput. 

The Line SC architecture can be seen as a tree architecture in which 
complexity is reduced by merging some of the PEs. An alternative 
would be to start from the same tree architecture and use the idle 
stages to overlap the decoding of several codewords at once, enhanc- 
ing the throughput. 

3.4. Vector-overlapping SC architecture 

Let's assume that we want to use idle cycles in the pipelined tree 
architecture in order to overlap the decoding of P vectors y. At CC 
#1, yi is fed into stage 2 of the pipelined tree decoder. At CC #2, 
a second vector y2 is shifted into stage 2 while yi uses stage 1. At 
CC #3, yi and y2 are in stages and 1 respectively. Then, a PE 
conflict occurs at CC #4 when both yi and y2 need to access stage 
0. This problem can be overcome by simply duplicating stage so 
no resource conflict happens. As shown in Table [2] by duplicating 
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Table 2. Schedule for the vector-overlapping SC architecture (n ■ 
and P = 3). 
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Fig. 5. Vector-overlapping SC decoder for n = 8 and P = 3. 



stage (denoted as Sod), it is possible to overlap up to 3 vectors at 
the same time. It would actually be possible to insert another vector 
by using the remaining idle resources, but the routing of data across 
the tree would lose its nice regular property, making the multiplex- 
ing design more complex. Since several vectors are decoded at the 
same time, each PE should have access to registers associated with 
each vector. This means that P register sets are required to decode 
P vectors in parallel. A vector-overlapping SC decoder is shown in 
Figure[5]for n = 8 and P = 3. The degree of parallelism P can ac- 
tually be enhanced by further duplicating PE stages. It can be shown 
that in order to reach parallelism P, each stage I should be dupli- 
cated TIt^tI times. This vector-overlapping architecture allows us 
to reach a maximum parallelism value of P = n — 1. The complex- 
ity and the throughput of a vector-overlapping SC architecture with 
parallelism P are 



C T = [n+ 



P+l 



lop 



P+l 



-1 



2Cn P +P(2n-l)Cr, (7) 



and T = 



P 



(8) 



This architecture provides a solution to enhance the parallelism of 
the decoder without duplicating all the resources of the decoder. 

4. MINIMUM APPROXIMATION 



SC decoding, in its original version, was proposed in the likelihood 
ratio domain in which the update rules / and g require multiplication 
and division. The hardware implementation of multipliers and di- 
viders is very expensive and usually avoided in practical decoder de- 
signs. We propose to perform SC decoding in the log-domain in or- 
der to reduce the complexity of the / and g computation blocks. We 
assume that the channel information is available as the log-likelihood 
ratios (LLRs) Li. In the LLR domain / and g become 



f(L a ,L b ) = 2tanh- 1 (tanh (%*-) tanh (^)) 
gii3 (L a ,L b ) = L a (~lf° +L b , 



where L a and L b are LLRs. In terms of hardware implemen- 
tation, g can be easily mapped to an adder/subtractor controlled by 
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Table 3. Comparison of SC decoder architectures. 



the bit u s . However, / involves some transcendental functions that 
are complex to implement in hardware. One can notice that the / 
and g functions are identical to the update rules used in BP decoding 
of LDPC codes. Consequently, similar to what is done in LDPC de- 
coder implementation |6|, / can be approximated with the minimum 
function such that 



f(L a ,L b ) w sign(La) sign(L fc )min(|L a |, \L b 



(10) 



In order to estimate the performance degradation incurred by this ap- 
proximation we simulated the performance of different polar codes 
on an AWGN channel with BPSK modulation. There was no signif- 
icant performance loss. 

5. CONCLUSION 

In this paper we showed that the architecture proposed by Ankan 
in |T) can be improved by taking advantage of the scheduling in SC 
decoding. Table [3] is a comparison of the complexity and through- 
put of the FFT-like SC decoder with the proposed architectures. The 
pipelined tree architecture and the line architecture allow us to reach 
the same throughput while reducing the hardware complexity. We 
also showed that throughput can be enhanced by decoding several 
vectors in parallel in a vector overlapping architecture. 
In this paper, we investigated fully-parallel architectures for SC de- 
coders. For very large code lengths, it would be required to consider 
semi-parallel architectures in which PEs are shared within the update 
phase of the same stage as suggested in Section [33l The very regular 
structure of polar codes makes semi-parallel architectures straight- 
forward to implement. 
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